Building Parallel Corpora by Automatic Title Alignment

نویسندگان

  • Christopher C. Yang
  • Kar Wing Li
چکیده

Cross-lingual semantic interoperability has drawn significant attention in recent digital library research as the digital libraries in languages other English has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish, and French, has been widely explored; however, CLIR across European languages and Oriental languages is still in the initial stage. To cross language boundary, corpus-based approach is promising to overcome the limitation of the knowledge-based and controlled vocabulary approaches but collecting parallel corpora between European language and Oriental language is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches and compare their performances in aligning English and Chinese titles of parallel documents available on the Web.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Building and Using Parallel Resources for SMT from Comparable Corpora

Building parallel resources for corpus based machine translation, especially Statistical Machine Translation (SMT), from comparable corpora has recently received wide attention in the field Machine Translation research. In this paper, we propose an automatic approach for extraction of parallel fragments from comparable corpora. The comparable corpora are collected from Wikipedia documents and t...

متن کامل

Building parallel corpora by automatic title alignment using length-based and text-based approaches

Cross-lingual semantic interoperability has drawn significant attention in recent digital library and World Wide Web research as the information in languages other than English has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish, and French, has been widely explored; however, CLIR across European languages and Orienta...

متن کامل

Conceptual analysis of parallel corpus collected from the Web

As illustrated by the World Wide Web, the volume of information in languages other than English has grown significantly in recent years. This highlights the importance of multilingual corpora. Much effort has been devoted to the compilation of multilingual corpora for the purpose of cross-lingual information retrieval and machine translation. Existing parallel corpora mostly involve European la...

متن کامل

SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer

It is well known that word aligned parallel corpora are valuable linguistic resources. Since many factors affect automatic alignment quality, manual post-editing may be required in some applications. While there are several state-of-the-art word-aligners, such as GIZA++ and Berkeley, there is no simple visual tool that would enable correcting and editing aligned corpora of different formats. We...

متن کامل

Automatic Alignment in Parallel Corpora

This paper addresses the alignment issue in the framework of exploitation of large bimultilingual corpora for translation purposes. A generic alignment scheme is proposed that can meet varying requirements of different applications. Depending on the level at which alignment is sought, appropriate surface linguistic information is invoked coupled with information about possible unit delimiters. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002